Probabilistic Retrieval of OCR
نویسندگان
چکیده
The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less eeective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described.
منابع مشابه
Probabilistic Retrieval of OCR Degraded Text Using N-Grams
Abst rac t . The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using mgram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A secon...
متن کاملA Content-based Probabilistic Correction Model for OCR Document Retrieval
The difficulty with information retrieval for OCR documents lies in the fact that OCR documents comprise of a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost...
متن کاملInformation retrieval for OCR documents: a content-based probabilistic correction model
The difficulty with information retrieval for OCR documents lies in the fact that OCR documents comprise of a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost...
متن کاملA probabilistic method for keyword retrieval in handwritten document images
Keyword retrieval in handwritten document images (word spotting) is very challenging given that OCR accuracy is not yet adequate for handwritten scripts, specially with large lexicons. Various proposed approaches build indices on information such as image features or OCR scores and have improved the performance of the traditional approach that builds index on OCR’ed text. In this paper, we impr...
متن کاملProbabilistic Retrieval Methods for Text with Miss-Recognized OCR Characters
This paper presents two probabilistic text retrieval methods speci cally designed to carry out a full-text search of Japanese documents containing OCR errors. By searching for any query term under the premise that errors exist in recognized text, the presented methods can tolerate such errors, and therefore manual post-editing is not required after OCR recognition. In the applied approach, conf...
متن کامل